CausNet: generational orderings based search for optimal Bayesian networks via dynamic programming with parent set constraints

Background Finding a globally optimal Bayesian Network using exhaustive search is a problem with super-exponential complexity, which severely restricts the number of variables that can feasibly be included. We implement a dynamic programming based algorithm with built-in dimensionality reduction and parent set identification. This reduces the search space substantially and can be applied to large-dimensional data. We use what we call ‘generational orderings’ based search for optimal networks, which is a novel way to efficiently search the space of possible networks given the possible parent sets. The algorithm supports both continuous and categorical data, as well as continuous, binary and survival outcomes. Results We demonstrate the efficacy of our algorithm on both synthetic and real data. In simulations, our algorithm performs better than three state-of-art algorithms that are currently used extensively. We then apply it to an Ovarian Cancer gene expression dataset with 513 genes and a survival outcome. Our algorithm is able to find an optimal network describing the disease pathway consisting of 6 genes leading to the outcome node in just 3.4 min on a personal computer with a 2.3 GHz Intel Core i9 processor with 16 GB RAM. Conclusions Our generational orderings based search for optimal networks is both an efficient and highly scalable approach for finding optimal Bayesian Networks and can be applied to 1000 s of variables. Using specifiable parameters—correlation, FDR cutoffs, and in-degree—one can increase or decrease the number of nodes and density of the networks. Availability of two scoring option—BIC and Bge—and implementation for survival outcomes and mixed data types makes our algorithm very suitable for many types of high dimensional data in a variety of fields.


Introduction
Optimal Bayesian network (BN) structure discovery is a method of learning Bayesian networks from data that has applications in wide variety of areas including epidemiology (see e.g. [1][2][3][4]). Disease pathways found using directed BN edges leading to a phenotype outcome can improve understanding, diagnosis and treatment of a disease. The main challenge in finding an optimal BN lies in the super-exponential complexity of the search [5]. Dynamic programming can reduce the complexity to exponential [6,7], but still the number of features/nodes feasibly explored remains very small-usually no more than 30-and only by restricting the maximum number of parents for each node [6,7]. To alleviate the challenge of high dimensionality, we implement a dynamic programming algorithm with parent set constraints. We use what we call 'generational orderings' based search for optimal networks, which is a novel way to efficiently search the space of possible networks given the possible parent sets.
Current algorithms typically do not accommodate both continuous and categorical nodes, and we were not able to find any that accommodate a survival outcome. We implement support for both continuous and categorical data, as well as continuous, binary and survival outcomes. This is especially useful for disease modeling where mixed data and survival outcomes are common. We also provide options for two common scoring functions, and allow for multiple best networks to be returned if there are ties.
Our main novel contribution in addition to providing software is the revision of the Silander algorithm 3 [6] to incorporate possible parent sets, and use of 'generational orderings' for a much more efficient way to explore the search space as compared to the original approach, which is based on lexicographical ordering. The proposed approach covers the entire constrained search space without searching through networks that don't conform to the parent set constraints.

Background
In this section, we briefly review Bayesian networks and the Bayesian network structure discovery problem (for more background on these topics see, for example, [8,9]).
A Bayesian network (BN) is a probabilistic graphical model that consists of a labeled directed acyclic graph (DAG) in which the vertices V = {v 1 , . . . , v p } correspond to random variables and the edges represent conditional dependence of one random variable on another. Each vertex v i is labeled with a conditional probability distribution P(v i |parents(v i )) that specifies the dependence of the variable v i on its set of parents parents(v i ) in the DAG. A BN can also be viewed as a factorized representation of the joint probability distribution over the random variables and as an encoding of conditional dependence and independence assumptions.
A Bayesian network G can be described as a vector G = (G 1 , . . . , G p ) of parent sets: G i is the subset of V from which there are directed edges to v i . Any G that is a DAG corresponds to an ordering of the nodes, given by the ordered set {v σ i }, i ∈ {1, 2, . . . , p} , where σ is a permutation of [p] -the ordered set of first p natural numbers, with σ (i) = σ i . A BN is said to consistent with an ordering One of the main methods for BN structure learning from data uses a scoring function that assigns a real value to the quality of G given the data. For finding a best network structure, we maximize this score over the space of possible networks. Note that we can have multiple best networks with the same score. Scoring functions balance goodness of fit to the data with a penalty term for model complexity. Some commonly used scoring functions are BIC/MDL [10], BDeu [11], and BGe [12,13]. We use BIC (Bayesian information criterion) and BGe (Bayesian Gaussian equivalent) scoring functions as two options for using Causnet. BIC is a log-likelihood (LL) score where the overfitting is avoided by using a penalty term for the number of parameters in the model, specifically p ln(n) , where n is the sample size. The BGe score is the posterior probability of the model hypothesis that the true distribution of the set of variables is faithful to the DAG model, meaning that it satisfies all the conditional independencies encoded by the DAG, and is proportional to the marginal likelihood and the graphical prior [12,13].

CausNet
CausNet uses the dynamic programming (DP) approach to finding a best Bayesian network structure for a given dataset. The idea of using dynamic programming for exact Bayesian network structure discovery was first proposed by Koivisto and Sood [14,15]. Recent work using dynamic programming, includes that by Silander and Myllymäki in [6] and by Singh and Andrew in [7].
We closely follow the algorithm proposed by Silander and Myllymäki (SM algorithm henceforth) and make heuristic modifications focusing on finding sparse networks and disease pathways. This is achieved by dimensionality reduction with parent set identification, and by restricting the search to the space of 'generational' orderings rather than lexicographical orderings as in the original SM algorithm.
Finding a best Bayesian network structure is NP-hard [5]. The number of possible structures for n variables is O n!2 ( n 2 ) [16], making exhaustive search impractical. So the dynamic programming algorithms for optimal BNs are feasible only for a small numbers of features, usually less than 30 [6,7]. These approaches are optimal in the sense that they are guaranteed to find a network with the best score. The number of variables can be increased somewhat by using small in-degree (maximum number of parents for any node), in which case the approach is optimal conditional on the constraint. However, bounding the in-degree by a constant k does not help much, the lower bound for the number of possible graphs is still n!2 kn log n (for large enough n) [14]. We introduce parent set identification and 'generational' orderings based search to reduce the search space and thus scale up the SM algorithm to a substantially larger numbers of variables. The SM algorithm uses the key fact about DAGs that every DAG has at least one sink, which is a node with no outgoing edges. The problem of finding a best Bayesian network given the data D starts with finding a best sink for the whole set of nodes. That node is removed and the process is then repeated recursively for the remaining set of nodes, which makes it a dynamic programming algorithm. The result is an ordering of the nodes {v σ i }, i ∈ {1, 2, .., p} , from which the DAG can be recovered. Denoting the best sink by s, and the score of a best network with nodes V by bestscore(V), and the best score of s with parents in U by bestScore(s, U), the recursion is given by the following relation: where V \ {s} denotes the set difference between the variable set V and the sink s.
To implement the above recursion, the idea of a local score for a node v i with parents parents(v i ) is used, which we get using a scoring function. The requirement for a score function score(G) for a network G is that it should be decomposable, meaning that the total score of the network is the sum of scores for each node in the network, and the score of a node depends only on the node and its parents. Formally, where the local scoring function localscore(x, y) gives the score of x with parents y in the network G. In a given set of possible parents pp i for node v i , we find the best set of parents bps i which give the best local score for v i , so that Now the best sink s can be found by Eq. 5, and the best score for a best network in V can be found by Eq. 6.
In Fig. 1, the subset lattice shows all the paths that need to be searched to find a best network. Observe that each edge in the lattice encodes a sink, so that each path also encodes an ordering on the 4 variables, e.g. the rightmost path encodes the reverseordering {1, 2, 3, 4} . There are a total of 4! paths/orderings to be searched. Now suppose we knew the best score corresponding to each edge in all the paths, meaning the best score for the sink corresponding to that edge with the best parents from the subset at the source of that edge. Then naive depth/width first search would compare the sum of scores along all paths to get the best network. In the SM approach, we proceed from the top of the lattice. Ignoring the empty set, we start with finding the best sink for each singleton in the first row which trivially is the singleton itself. Next, we find the best sink for the subsets of cardinality 2 in the second row using the edge best scores. And we continue all the way down. Suppose we get the best sinks sublattice as in Fig. 2, then the best network is given by the only fully connected path, and corresponds to the reverseordering {4, 1, 2, 3}. Now, using the SM algorithm, finding the best Bayesian network structure, also the basic CausNet approach without restricting the search space, has the following five steps: 1. Calculate the local scores for all p2 p−1 different (variable, variable set)-pairs. 2. Using the local scores, find best parents for all p2 p−1 (variable, possible parent set)pairs. 3. Find the best sink for all 2 p variable sets. 4. Using the results from Step 3, find a best ordering of the variables. 5. Find a best network using results computed in Steps 2 and 4.
The extensions to this base version of CausNet-possible parent sets identification, phenotype driven search, and search space based on 'Generational orderings' reduce the search space.

Possible parent sets identification
The possible parent set pp i for each node V i , such that pp i ⊆ U i ⊆ V \ {V i } , is determined in a preliminary step using marginal association. For this, we test for pairwise association between variables using Pearson's product moment correlation test. Either a specifiable p value α of the test is used as False Discovery Rate (FDR) cut-off to identify associations between pairs or we can use correlation value cutoffs. The choice of a target FDR level can also be made in a post hoc fashion and can be determined by such factors as strength of evidence for observed associations, investigators tolerance for complexity versus need for interpretable results, and cost of follow-up studies [17]. Decreasing the FDR or increasing the correlation cutoff reduces the number of nodes to be considered, thus leading to sparser networks.
This gives a possible parent set pp i for each node V i . If the response is a survival outcome, we evaluate associations using Cox proportional hazards regression, independently for each feature. Thus, we make the practical approximation that if there is no detectable marginal association, then the feature is unlikely to have a causal input that is meaningful or at least detectable in the data. Therefore, the feature without evidence of marginal association need not be included as a possible parent.

Phenotype driven search
In biomedical applications, one is often interested in a small set of predictors that affect a phenotype of interest. Diffusion-based prioritization of genes as risk predictors leading to an outcome has been shown to identify a subnetwork of interest [18,19]. A one-hop or k-hop approach is shown to find such networks [18]. We apply a similar approach in our algorithm with a 'phenotype driven search' . In this approach, we consider only two or three levels of associations (similar to 1-hop and 2-hop respectively in [18]) starting with the outcome variable, i.e. we identify the parents, 'grandparents' and 'great-grandparents' of the phenotype outcome.
Identifying variables with evidence of association as defined by the threshold gives us the "feasible set" (feasSet) of nodes. The original data is then reduced to the feas-SetData, which includes only the feasSet variables. This is implemented as in Algorithm 1. After Algorithm 1, the dimension of data is reduced to p , p < p , where p is the number of nodes in the feasSet.

Dynamic programming on the space of 'Generational orderings' of nodes
After the possible parent sets are identified, the next step is to compute local scores for feasSet nodes as shown in Algorithm 2. Computing local scores has computational complexity O p2p −1 if there was no possible parent sets identified for nodes, but reduces to O pr d , where r is the maximum cardinality of possible parent sets of all nodes and d is the in-degree. Because of bounded indegree, the step at line 3 in this algorithm is truncated at cardinality indegree. The next step is to compute the best scores and best parents for feasSet nodes in all possible parent subsets as shown in Algorithm 3. This uses local scores already calculated in Algorithm 2. This has computational complexity O pr2 r−1 , where r is the maximum cardinality of possible parent sets of all nodes.
Then we compute best sinks for all possible subsets of feasSet nodes restricted by the parent set constraints. Now compared with the SM algorithm that explores all orderings, we explore only the space of what we call 'generational orderings' to find the best BNs. Recall from the Background section above that a DAG corresponds to an ordering of the nodes {v σ i } with the constraints that parents of a node v σ i in the ordering are a subset of {v σ j } , with j < i , i.e. each node can have parents only from the set of nodes above itself in the ordering. To include the parent set constraints, we search only the orderings that are consistent with the possible parent sets for each node. The resulting subset of orderings is the space of 'generational orderings'. Definition 3.1 A generational ordering is an ordering such that each variable in the ordering has at least one parent from the set of possible parents in the set of variables preceding it in the ordering, consistent with the possible parent sets.
Starting with p networks of single nodes, we add one offspring at a time, and iterate over all nodes in the set to find a best sink in a subset of cardinality increasing from 1 to p , as in algorithm 4. Observe that while finding best sink for a subset of cardinalty k at level k, we already have the best networks of cardinalty k − 1 at level k − 1 , which is the key feature of the DP approach. Once we have these best sinks, we compute the reverse ordering (possibly multiple orderings) of p nodes and compute best network as shown in algorithm 5.
node 4 as a possible parent in our example. The bottom lattice retains only the subsets that are in the complete path from the top of the lattice to its bottom, and discards the remaining paths. These full paths represent what we call complete generational orderings. This is how generational ordering of Causnet ensures maximum connectivity among the reduced set of p nodes in the feasSet. The red arrows which at base are blue as well, represent the best network.

Definition 3.2 A generational ordering is a complete generational ordering if it has all
the variables in the feasSet in the ordering.

Example 3.1
In the top subset lattice in Fig. 3

Simulations
We compare CausNet with three other methods that have been widely used for optimal Bayesian network identification to infer disease pathways from multiscale genomics data. The first method is Bartlett and Cussens' GOBNILP [20], an integer learning based method that's considered state-of-art exact method for finding optimal Bayesian network. The other two methods are BNlearn's Hill Climbing (HC) and Max-min Hill Climbing (MMHC) [21,22], which are both widely used approximate methods, see e.g. [23,24]. Hill-Climbing (HC) is a score-based algorithm that uses greedy search on the space of the directed graphs [25]. Max-Min Hill-Climbing (MMHC) is a hybrid algorithm [26] that first learns the undirected skeleton of a graph using a constraint-based algorithm called Max-Min Parents and Children (MMPC); this is followed by the application of a score-based search to orient the edges. We simulated Bayesian networks by generating an N x p data matrix of continuous Gaussian data. The dependencies are simulated using linear regression with the option to control effect sizes. Some number of the p nodes were designated as sources ( p 1 ), some intermediate ( p 2 ), and some sinks ( p 3 ), the remainder ( p 0 ) being completely independent. The actual DAGs of the p 1 + p 2 + p 3 nodes vary across replicates. The requirement for being a DAG is implemented using the idea of ordering of vertices. We pick a random ordering of a randomly chosen subset of p vertices. Then enforcing each vertex to have parents only from the set of vertices above itself in the ordering guarantees a DAG.
The False Discovery Rate (FDR) and Hamming Distance are used as the metrics to compare the methods. With FP defined as the number of false positives and TP defined as the number of true positives, FDR is defined as: Controlling for the false discovery rate (FDR) is a way to identify as many significant features (edges in case of BNs) as possible while incurring a relatively low proportion of false positives. This is especially useful metric for high dimensional data and for network analysis ( [27]).
The Hamming distance between two labeled graphs G 1 and G 2 is given by |{((e ∈ E(G 1 )&(e � ∈ E(G 2 )))or((e � ∈ E(G 1 )&e ∈ E(G 2 )))}| , where E(G i ) is the edge set of graph G i . Simply put, this is the number of addition/deletion operations required to turn the edge set of G 1 into that of G 2 . The Hamming distance is a measure of structural similarity, and forms a metric on the space of graphs (simple or directed), and gives a good measure of goodness of a predicted graph ( [28,29]). In the context of predicted and the truth graph, with FP defined as the number of false positives and FN as the number of false negatives, Hamming Distance is defined as: As the first set of simulations using parent set identification, we ran simulations using multiple replicates of networks, the first with p = 10, 20, 40, 50, 60, 100 , and N = 500, 1000, 2000 . Figure 4 shows the plot of average FDR for different values of p and N, and their linear trend with BIC scoring for CausNet. For using CausNet, We have used FDR cutoff of 0.3 and an in-degree of 2 for all the experiments.
We can see that for lower values of p, Gobnilp has the lowest values of FDR with CausNet the second best, but CausNet performs the best for higher values of p. The results for the BGe scoring for CausNet are similar qualitatively. In the Tables 1 and 2, we show the average FDR across the 9 combinations of N and p, both with and without taking directionality into account. The results for CausNet with the choice of scoring function-either BGE or BIC-are given. The results show that our method performs very well compared with the methods considered.
For the number of variables up to 40 (Table 1), on the metric of FDR, CausNet performs second best after Gobnilp for directed graphs and the best for undirected graphs. This is true for both scoring methods-BIC and BGE. For the number of variables between 50 and 100 (Table 2), our method performs the best for both directed and undirected graphs, using either of the two scoring methods, BIC and BGE. Figure 5 shows the plot of average Hamming Distance for different values of p and N, and their linear trend with BIC scoring for CausNet. We can see that for lower values of p, Gobnilp has the lowest values of Hamming Distance with CausNet the second best for most part, but CausNet performs the best for higher values of p. The results for the BGe scoring for CausNet are the same qualitatively. The FDR for phenotype-driven parent sets is the best for number of variables greater than 10; Gobnilp is the second best. In terms of Hamming distance too, Causnet again is the best for variables more than 20; Gobnilp and MMHC have lower   Hamming distancep than that of Causnet for variables less than 20, but rise quickly for variables more than 20; overall, Gobnilp is again the second best.

Number of variables up to 1000
For these simulations, we use p = 200, 500, 1000 , and N = 500, 1000, 2000 . For these simulations, we don't compare with the other three algorithms as they either can not handle such high number of variables or take many orders of magnitude longer. The results are shown in Figs. 8 and 9. Observe that both the FDR and Hamming Distance values are better than what we had with other methods when the number of variables was less than 100.

Runtime
Having confirmed the performance of CausNet as superior to other algorithms, especially for the number of variables greater than 40, we compare the runtimes of the four algorithms. Here we split the comparison into two categories -number of variables less than 100 and greater than 100, i.e p ≤ 100 and 100 < p ≤ 1000 . This is done as two of the algorithms -MMHC and Gobnilp -either can not handle more than 100 variables or the computation time is many orders of magnitude greater than that taken by CausNet. Specifically, MMHC takes over an average of 300 min, and Gobnilp terminates without producing any output network in most cases. Running all the simulations for these two algorithms would require an inordinate amount of time and was not considered  worth the effort. The average runtimes in seconds are summarized in Table 3. As we can clearly see, Causnet has the best runtimes for p ≤ 100 and Gobnilp the worst. For 100 < p ≤ 1000 , CausNet is better than Gobnilp and MMHC. Although HC is faster, sections above demonstrated it to be inferior to CausNet in terms of performance metrics of FDR and Hamming distance.

Application to clinical trial data
We applied our method to recently published data involving gene expression and ovarian cancer prognosis among participants in multiple clinical trials [30]. The aim of this study was to develop a prognostic signature based on gene expression for overall survival (OS) in patients with high-grade serous ovarian cancer (HGSOC). Expression of 513 genes, selected from a meta-analysis of 1455 tumors and other candidates, was measured using NanoString technology from formalin-fixed paraffin-embedded tumor tissue collected from 3769 women with HGSOC from multiple studies. Elastic net regularization for survival analysis was applied to develop a prognostic model for 5-year OS, trained on 2702 tumors from 15 studies and evaluated on an independent set of 1067 tumours from six studies. Results of this study showed that expression levels of 276 genes were associated with OS (false discovery rate < 0.05 ) in covariate-adjusted single-gene analyses. We applied our method CausNet to this gene expression dataset of 513 genes and survival outcome. For dimensionality reduction and parent set identification, we used a three-level phenotype driven search. For the disease node 'Status' , dead or alive at censoring or end of follow-up, we carried out Cox proportional hazard regression for each gene separately, adjusted for age, stage, and stratified site. Then we computed analysis of variance tables for the fitted models and created a list of p values based on the χ 2 distribution. FDR was then computed using the Benjamini & Hochberg (BH) method. We choose the 5 genes, ZFHX4, TIMP3, COL5A2, FBN1, and COL3A1, with the most significant p-values as possible parents of the disease node. At the next level, we used correlations between these and rest of the genes to identify possible parent sets. Parent sets were also identified for possible grand-parents of the disease node. These three levels of ancestors of the disease node resulted in 16 genes with non-null parent sets. This process substantially reduced the dimensionality of the dataset. The sets of possible parents were then used by CausNet with the BIC score to find a best network. Using the in-degree of 2, we identified a best network as shown in Fig. 10. On a personal computer with a 2.3 GHz Intel Core i9 processor with 16 GB RAM, this processing took about Fig. 10 Ovarian cancer network 3.4 min. As no other methods in our study or otherwise work for survival outcome, our method is the only method able to handle such large data, and survival outcome in such short runtime to produce an optimal BN.

Discussion
We implemented a dynamic programming based optimal Bayesian network (BN) structure discovery algorithm with parent set identification with 'generational orderings' based search for optimal networks, which is a novel way to efficiently search the space of possible networks given the possible parent set.
Our main novel contribution aside from providing software is the revision of the SM Algorithm 3 [6] to incorporate possible parent sets and 'generational orderings' based search for a more efficient way to explore the search space as compared to the original approach based on lexicographical ordering. In doing so, we cover the entire constrained search space without searching through networks that don't conform to the parent set constraints. While the basic algorithm can be applied to any dataset from any domain, the phenotype based algorithm is particularly suitable for disease outcome modeling.
The simulation results show that our algorithm performs very well when compared with three state-of-art algorithms that are widely used currently. The parent set constraints reduce both the search space and the runtime significantly, while delivering better results, especially for greater than 60 variables.
The application to the recently published ovarian cancer gene-expression data with survival outcomes showed the algorithm's usefulness in disease modeling. It yielded a sparse network of 6 nodes leading to the disease outcome from gene expression data of 513 genes in just 3.4 min on a personal computer with a 2.3 GHz Intel Core i9 processor with 16 GB RAM. This disease pathway may generate guiding insights and hypotheses for further biomedical studies. Important features of the algorithm include specifiable parameters-correlation, FDR cutoffs, and in-degree-which can be tuned according to the application domain. Choice of two scoring options, BIC and Bge, and implementation of survival outcomes and mixed data types makes our algorithm suitable for identifying disease pathways from a broad range of biomedical data types, e.g. GWAS and other omics.